<style type="text/css"> @import url('https://fonts.googleapis.com/css2?family=Bebas+Neue:wght@400;700&display=swap'); pre { display: block; font-family: monospace; white-space: pre; margin: 1em 0px; margin-top: 0em; margin-right: 0px; margin-bottom: 0em; margin-left: 0px; white-space: pre-wrap; } p { line-height: 22px; } h1{ margin-bottom: -20px; } h2{ margin-bottom: -11px; } h3{ margin-bottom: -10px; } .emphasized { font-size: 1.2em; font-family: Nunito; } .greenemph { font-size: 1.2em; font-family: Nunito; color: #69995D } .greenhead { font-size: 35px; font-family: Nunito; color: #69995D } .nunito { font-size: 25px; font-family: Nunito; } .nunitosm { font-family: Nunito; } .nunitosmgrey { font-family: Nunito; color: #E8E9E8 } .invisible { color: white; } .remark-code { background: #green; } .remark-slide-number { font-size: 10pt; margin-bottom: -11.6px; margin-right: 10px; color: #FFFFFF; /* white */ opacity: 0; /* default: 0.5 */ } </style> <br> <br> <br> <br> <br> <br> <br> .center[ # DATA VISUALIZATION .nunito[LECTURE 2] .greenhead[INTRO TO R PROGRAMMING] **MARIA MONTOYA-AGUIRRE** M1 APE @ PARIS SCHOOL OF ECONOMICS ] --- # HOMEWORK REVIEW ✅️Inspect the IMDB dataset `glimpse()` `table()` and describe 5 issues we should solve in the data cleaning/tidying process that we haven’t addressed in this lecture. Write down the name of at least one function that would help for this. Google if you need! - The year variable is not numeric and has string characters. Some functions: `str_length()`, `str_sub()`, `as.numeric()` ✅️ Are older movies better? 1) Fix the year variable. This can be done in multiple ways! Try with `str_length()` and `str_sub()` 2) Create a variable that indicates whether a movie is older or more recent than the average year in the sample. 3) Create a table with the average rating by decade (start from the 50s). --- ## EXERCISE 1 ✅️Inspect the IMDB dataset `glimpse()` `table()` and describe 5 issues we should solve in the data cleaning/tidying process that we haven’t addressed in this lecture. Write down the name of at least one function that would help for this. Google if you need! -- ```r imdb <- read.csv(here("data","01_imdb-top250-french.csv"), encoding = "UTF-8") glimpse(imdb) ``` -- ``` ## Rows: 250 ## Columns: 13 ## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … ## $ Name <chr> "Shoah", "Home", "Untouchable", "Le Trou", "The Man Who… ## $ Rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … ## $ Year <chr> "-1985", "(I) (2009)", "-2011", "-1960", "-1987", "-194… ## $ Type <chr> "PG", "U", "15", "A", "", "A", "15", "15", "A", "U", "U… ## $ Duration <chr> "566 min", "118 min", "112 min", "131 min", "30 min", "… ## $ Genre <chr> "Documentary, History, War", "Documentary", "Biography,… ## $ Rating <dbl> 8.7, 8.5, 8.5, 8.5, 8.5, 8.3, 8.3, 8.3, 8.2, 8.2, 8.2, … ## $ MetaScore <int> 99, 47, 57, NA, NA, 96, 69, 80, NA, NA, NA, 86, 99, NA,… ## $ Desc <chr> "Claude Lanzmann's epic documentary recounts the story … ## $ Director_Stars <chr> "Director:\nClaude Lanzmann\n | \n S… ## $ Votes <chr> "9,984", "", "", "", "", "", "", "", "", "", "", "", ""… ## $ Gross <chr> "$0.02M", "", "", "", "", "", "", "", "", "", "", "", "… ``` --- ### EXAMPLE ``` ## Rows: 250 ## Columns: 13 ## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … ## $ Name <chr> "Shoah", "Home", "Untouchable", "Le Trou", "The Man Who… ## $ Rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … *## $ Year <chr> "-1985", "(I) (2009)", "-2011", "-1960", "-1987", "-194… ## $ Type <chr> "PG", "U", "15", "A", "", "A", "15", "15", "A", "U", "U… ## $ Duration <chr> "566 min", "118 min", "112 min", "131 min", "30 min", "… ## $ Genre <chr> "Documentary, History, War", "Documentary", "Biography,… ## $ Rating <dbl> 8.7, 8.5, 8.5, 8.5, 8.5, 8.3, 8.3, 8.3, 8.2, 8.2, 8.2, … ## $ MetaScore <int> 99, 47, 57, NA, NA, 96, 69, 80, NA, NA, NA, 86, 99, NA,… ## $ Desc <chr> "Claude Lanzmann's epic documentary recounts the story … ## $ Director_Stars <chr> "Director:\nClaude Lanzmann\n | \n S… ## $ Votes <chr> "9,984", "", "", "", "", "", "", "", "", "", "", "", ""… ## $ Gross <chr> "$0.02M", "", "", "", "", "", "", "", "", "", "", "", "… ``` -- - The `Year`variable is not numeric and has strings that should not be there "-" "()" -- `str_length()`, `str_sub()`, `as.numeric()` --- ### 1 ``` ## Rows: 250 ## Columns: 13 ## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … ## $ Name <chr> "Shoah", "Home", "Untouchable", "Le Trou", "The Man Who… ## $ Rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … ## $ Year <chr> "-1985", "(I) (2009)", "-2011", "-1960", "-1987", "-194… *## $ Type <chr> "PG", "U", "15", "A", "", "A", "15", "15", "A", "U", "U… ## $ Duration <chr> "566 min", "118 min", "112 min", "131 min", "30 min", "… ## $ Genre <chr> "Documentary, History, War", "Documentary", "Biography,… ## $ Rating <dbl> 8.7, 8.5, 8.5, 8.5, 8.5, 8.3, 8.3, 8.3, 8.2, 8.2, 8.2, … ## $ MetaScore <int> 99, 47, 57, NA, NA, 96, 69, 80, NA, NA, NA, 86, 99, NA,… ## $ Desc <chr> "Claude Lanzmann's epic documentary recounts the story … ## $ Director_Stars <chr> "Director:\nClaude Lanzmann\n | \n S… ## $ Votes <chr> "9,984", "", "", "", "", "", "", "", "", "", "", "", ""… ## $ Gross <chr> "$0.02M", "", "", "", "", "", "", "", "", "", "", "", "… ``` -- - There are missing values in the `Type` variable that are not coded as missing but are instead only blank `""`. -- `if_else()`, `NA_character_` --- ### 2 ``` ## Rows: 250 ## Columns: 13 ## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … ## $ Name <chr> "Shoah", "Home", "Untouchable", "Le Trou", "The Man Who… ## $ Rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … ## $ Year <chr> "-1985", "(I) (2009)", "-2011", "-1960", "-1987", "-194… ## $ Type <chr> "PG", "U", "15", "A", "", "A", "15", "15", "A", "U", "U… ## $ Duration <chr> "566 min", "118 min", "112 min", "131 min", "30 min", "… *## $ Genre <chr> "Documentary, History, War", "Documentary", "Biography,… ## $ Rating <dbl> 8.7, 8.5, 8.5, 8.5, 8.5, 8.3, 8.3, 8.3, 8.2, 8.2, 8.2, … ## $ MetaScore <int> 99, 47, 57, NA, NA, 96, 69, 80, NA, NA, NA, 86, 99, NA,… ## $ Desc <chr> "Claude Lanzmann's epic documentary recounts the story … ## $ Director_Stars <chr> "Director:\nClaude Lanzmann\n | \n S… ## $ Votes <chr> "9,984", "", "", "", "", "", "", "", "", "", "", "", ""… ## $ Gross <chr> "$0.02M", "", "", "", "", "", "", "", "", "", "", "", "… ``` -- - The `Genre` variable is not useful as it is. It should be categorical variable or a dummy. -- `str_detect()`, `case_when()` --- ### 3 & 4 ``` ## Rows: 250 ## Columns: 13 ## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … ## $ Name <chr> "Shoah", "Home", "Untouchable", "Le Trou", "The Man Who… ## $ Rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … ## $ Year <chr> "-1985", "(I) (2009)", "-2011", "-1960", "-1987", "-194… ## $ Type <chr> "PG", "U", "15", "A", "", "A", "15", "15", "A", "U", "U… ## $ Duration <chr> "566 min", "118 min", "112 min", "131 min", "30 min", "… ## $ Genre <chr> "Documentary, History, War", "Documentary", "Biography,… ## $ Rating <dbl> 8.7, 8.5, 8.5, 8.5, 8.5, 8.3, 8.3, 8.3, 8.2, 8.2, 8.2, … ## $ MetaScore <int> 99, 47, 57, NA, NA, 96, 69, 80, NA, NA, NA, 86, 99, NA,… ## $ Desc <chr> "Claude Lanzmann's epic documentary recounts the story … *## $ Director_Stars <chr> "Director:\nClaude Lanzmann\n | \n S… ## $ Votes <chr> "9,984", "", "", "", "", "", "", "", "", "", "", "", ""… ## $ Gross <chr> "$0.02M", "", "", "", "", "", "", "", "", "", "", "", "… ``` -- ``` ## [1] "Director:\nClaude Lanzmann\n | \n Stars:\nSimon Srebnik, \nMichael Podchlebnik, \nMotke Zaïdl, \nHanna Zaïdl" ``` -- - The variable `Director_Stars` stores two variables in one column, it should be split -- `separate_wider_delim()` -- - The variable Director_Stars should not have the variable name in the content -- `gsub()`, `str_remove()` --- ### 5 ``` ## Rows: 250 ## Columns: 13 ## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … ## $ Name <chr> "Shoah", "Home", "Untouchable", "Le Trou", "The Man Who… ## $ Rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … ## $ Year <chr> "-1985", "(I) (2009)", "-2011", "-1960", "-1987", "-194… ## $ Type <chr> "PG", "U", "15", "A", "", "A", "15", "15", "A", "U", "U… ## $ Duration <chr> "566 min", "118 min", "112 min", "131 min", "30 min", "… ## $ Genre <chr> "Documentary, History, War", "Documentary", "Biography,… ## $ Rating <dbl> 8.7, 8.5, 8.5, 8.5, 8.5, 8.3, 8.3, 8.3, 8.2, 8.2, 8.2, … ## $ MetaScore <int> 99, 47, 57, NA, NA, 96, 69, 80, NA, NA, NA, 86, 99, NA,… ## $ Desc <chr> "Claude Lanzmann's epic documentary recounts the story … ## $ Director_Stars <chr> "Director:\nClaude Lanzmann\n | \n S… *## $ Votes <chr> "9,984", "", "", "", "", "", "", "", "", "", "", "", ""… *## $ Gross <chr> "$0.02M", "", "", "", "", "", "", "", "", "", "", "", "… ``` - All the values in the `Gross` and `Votes` variables are missing `select()` --- ## EXERCISE 2 1) Fix the year variable. This can be done in multiple ways! Try with `str_length()` and `str_sub()` ```r imdb$Year[1:50] ``` ``` ## [1] "-1985" "(I) (2009)" "-2011" "-1960" "-1987" ## [6] "-1945" "-2001" "-2010" "-1962" "-1956" ## [11] "-1902" "-1969" "-1969" "-1995" "-1959" ## [16] "-1994" "-2019" "-1953" "-1966" "(I) (2014)" ## [21] "-1928" "-1967" "-1937" "-1955" "-2007" ## [26] "-2007" "-1955" "-1952" "-1987" "-1962" ## [31] "-1939" "-1986" "-1970" "-1958" "-1986" ## [36] "-1973" "-1983" "-1966" "-1950" "-1953" ## [41] "-1963" "-1966" "-1973" "(I) (2011)" "-2009" ## [46] "-2012" "-1993" "-1960" "-2004" "-1972" ``` -- - There are two formats: `-YYYY` and `(I) (YYYY)`. We need to keep only the `YYYY` part to convert it to numeric. - What do `str_length()` and `str_sub()` do? --- 1) Fix the year variable. This can be done in multiple ways! Try with `str_length()` and `str_sub()` 2) Create a variable that indicates whether a movie is older or more recent than the average year in the sample. ```r imdb <- imdb %>% # Fix year variable mutate(year = if_else(str_length(Year) == 5, # If format is "-YYYY", str_sub(Year, 2, 5), # extract positions 2-5 str_sub(Year, 6, 9)), # otherwise it's "(I) (YYYY)" # so, keep positions 6-9 year = as.numeric(year)) %>% # Convert to numeric # ``` --- 1) Fix the year variable. This can be done in multiple ways! Try with `str_length()` and `str_sub()` 2) Create a variable that indicates whether a movie is older or more recent than the average year in the sample. ```r imdb <- imdb %>% # Fix year variable mutate(year = if_else(str_length(Year) == 5, # If format is "-YYYY", str_sub(Year, 2, 5), # extract positions 2-5 str_sub(Year, 6, 9)), # otherwise it's "(I) (YYYY)" # so, keep positions 6-9 year = as.numeric(year)) %>% # Convert to numeric # Create an "older than average" indicator mutate(avg_year = mean(year), # Calculate average year older_than_avg = year < avg_year) %>% # Create indicator select(Name, year, avg_year, older_than_avg, Rating ) head(imdb) ``` -- ``` ## Name year avg_year older_than_avg Rating ## 1 Shoah 1985 1983.936 FALSE 8.7 ## 2 Home 2009 1983.936 FALSE 8.5 ## 3 Untouchable 2011 1983.936 FALSE 8.5 ## 4 Le Trou 1960 1983.936 TRUE 8.5 ## 5 The Man Who Planted Trees 1987 1983.936 FALSE 8.5 ## 6 Children of Paradise 1945 1983.936 TRUE 8.3 ``` --- ### OTHER WAYS TO FIX YEAR FROM YOUR ANSWERS <br> Using **regular expressions**, a concise language for describing patterns strings. See a [cheatsheet on how to use them with package {stringr} here](https://github.com/rstudio/cheatsheets/blob/main/strings.pdf). ```r imdb %>% mutate(Year = ...) str_extract(Year,"[:digit:]{4}") # Extract exactly 4 digits str_replace_all(Year, "[()I-]", "") # Replace the characters ()I- for nothing - gsub("[^0-9.]+", "", Year) # Replace anything that is not 0-9 (one or more) str_sub(parse_number(Year), 2, 5) # Keep only numbers, and then only the last 4 strings ``` --- 3) Create a table with the average rating by decade (start from the 50s). -- ```r imdb %>% # Create decade indicator mutate(decade = case_when(year %in% c(1950:1959) ~ "50s", year %in% c(1960:1969) ~ "60s", year %in% c(1970:1979) ~ "70s", year %in% c(1980:1989) ~ "80s", year %in% c(1990:1999) ~ "90s", year %in% c(2000:2009) ~ "2000s", year %in% c(2010:2019) ~ "2010s", year %in% c(2019:2029) ~ "2020s", .default = "40s or older")) %>% # ``` --- 3) Create a table with the average rating by decade (start from the 50s). ```r imdb %>% # Create decade indicator mutate(decade = case_when(year %in% c(1950:1959) ~ "50s", year %in% c(1960:1969) ~ "60s", year %in% c(1970:1979) ~ "70s", year %in% c(1980:1989) ~ "80s", year %in% c(1990:1999) ~ "90s", year %in% c(2000:2009) ~ "2000s", year %in% c(2010:2019) ~ "2010s", year %in% c(2019:2029) ~ "2020s", .default = "40s or older")) %>% # Create a table with average rating by decade group_by(decade) %>% summarise(avg_rating = mean(Rating)) # ``` -- ``` ## # A tibble: 8 × 2 ## decade avg_rating ## <chr> <dbl> ## 1 2000s 7.54 ## 2 2010s 7.59 ## 3 40s or older 7.79 ## 4 50s 7.83 ## 5 60s 7.70 ## 6 70s 7.56 ## 7 80s 7.66 ## 8 90s 7.52 ``` --- <!-- Full --> 3) Create a table with the average rating by decade (start from the 50s). ```r imdb %>% # Create decade indicator mutate(decade = case_when(year %in% c(1950:1959) ~ "50s", year %in% c(1960:1969) ~ "60s", year %in% c(1970:1979) ~ "70s", year %in% c(1980:1989) ~ "80s", year %in% c(1990:1999) ~ "90s", year %in% c(2000:2009) ~ "2000s", year %in% c(2010:2019) ~ "2010s", year %in% c(2019:2029) ~ "2020s", .default = "40s or older")) %>% # Create a table with average rating by decade group_by(decade) %>% summarise(avg_rating = mean(Rating)) %>% * arrange(-avg_rating) ``` ``` ## # A tibble: 8 × 2 ## decade avg_rating ## <chr> <dbl> ## 1 50s 7.83 ## 2 40s or older 7.79 ## 3 60s 7.70 ## 4 80s 7.66 ## 5 2010s 7.59 ## 6 70s 7.56 ## 7 2000s 7.54 ## 8 90s 7.52 ``` --- ### OTHER WAYS TO CREATE THE DECADE FROM YOUR ANSWERS <br> ```r imdb %>% mutate(decade = ...) paste(str_sub(Year, start = 3, end = 3),"0s", sep="") # Get the third string, and add "0s" as.numeric(substr(Year, start = 1, stop=3)) * 10 # Get first three digits of year and multiply by 10 floor(Year/10) * 10 # Divide the year by 10 and round downwards Year %/% 10 * 10 # Integer division, get the quotient ``` --- class: inverse # WARM-UP USING {dplyr} <br> - Import `02_taylor-swift-spotify.csv` (originally obtained [here](https://www.kaggle.com/datasets/jarredpriester/taylor-swift-spotify-dataset?resource=download)) and `View()` the data - Inspect the structure of the data using `glimpse()` - Use `summarise()` to compute for each album the average danceability and the the number of songs included (1 row per song) - Create a subset of the data called `maxpop`containing the variables `album`, `release_date`, `danceability` and `popularity` for the 10 most popular songs. Use function `arrange()` and `row_number()`
−
+
10
:
00
--- ### WARM-UP - Import `02_taylor-swift-spotify.csv` (originally obtained [here](https://www.kaggle.com/datasets/jarredpriester/taylor-swift-spotify-dataset?resource=download)) and `View()` the data - Inspect the structure of the data using `glimpse()` -- ```r ts <- read.csv(here("data", "02_taylor-swift-spotify.csv")) glimpse(ts) ``` ``` ## Rows: 487 ## Columns: 21 ## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, … ## $ name <chr> "Mine (Taylor's Version)", "Sparks Fly (Taylor’s Ve… ## $ album <chr> "Speak Now", "Speak Now", "Speak Now", "Speak Now",… ## $ release_date <chr> "2023-07-07", "2023-07-07", "2023-07-07", "2023-07-… ## $ track_number <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, … ## $ id <chr> "7G0gBu6nLdhFDPRLc0HdDG", "3MytWN8L7shNYzGl4tAKRp",… ## $ uri <chr> "spotify:track:7G0gBu6nLdhFDPRLc0HdDG", "spotify:tr… ## $ acousticness <dbl> 0.004440, 0.025100, 0.006210, 0.248000, 0.023600, 0… .... ``` --- ### WARM-UP - Use `summarise()` to compute for each album the average danceability and the the number of songs included (1 row per song) -- ```r ts %>% group_by(album) %>% summarise(avg_danceability = mean(danceability), songs = n()) ``` ``` ## # A tibble: 11 × 3 ## album avg_danceability songs ## <chr> <dbl> <int> ## 1 1989 0.640 32 ## 2 Fearless 0.569 58 ## 3 Lover 0.658 18 ## 4 Midnights 0.628 56 ## 5 Red 0.606 68 ## 6 Speak Now 0.527 72 ## 7 Taylor Swift 0.545 15 ## 8 evermore 0.523 32 ## 9 folklore 0.553 67 .... ``` --- ### WARM-UP - Create a subset of the data called `maxpop`containing the variables `album`, `release_date`, `danceability` and `popularity` for the 10 most popular songs. Use function `arrange()` and `row_number()` -- ```r maxpop <- ts %>% arrange(-popularity) %>% # Order by popularity (descending) select(album, release_date, danceability, popularity) %>% filter(row_number() <= 10) # Keep first 10 rows maxpop ``` ``` ## album release_date danceability popularity ## 1 Lover 2019-08-23 0.552 100 ## 2 Midnights 2022-10-21 0.637 93 ## 3 folklore 2020-07-24 0.532 93 ## 4 Speak Now 2023-07-07 0.694 91 ## 5 folklore 2020-07-24 0.613 91 ## 6 Lover 2019-08-23 0.359 91 ## 7 reputation 2017-11-10 0.615 91 ## 8 Midnights 2022-10-21 0.642 90 ## 9 Speak Now 2023-07-07 0.505 89 ## 10 Speak Now 2023-07-07 0.497 88 ``` --- # AGENDA .nunitosm[ .pull-left[ .nunitosmgrey[] - AN INTRO TO DATA VISUALIZATION - THE ggplot() FUNCTION 1. MAIN STRUCTURE 2. AESTHETICS 3. ADDING DIMENSIONS 4. MAPPING vs. SETTING ATTRIBUTES 5. SAVING ] .pull-right[ - BROWSING THE TOOLBOX 1. LABELS 2. ANNOTATIONS 3. ADDING LAYERS 4. GROUPING DATA 5. STATISTICAL SUMMARIES - WHICH GRAPH SHOULD I USE? - GRAPHICAL EXCELLENCE AND INTEGRITY ] ] --- # DATA VISUALIZATION First known graphics applied to economic data were produced by William Playfair (1759-1823), a Scottish political economist in his book *The Commercial and Political Atlas* <sup>1</sup> <img src="data:image/png;base64,#inputs/playfair-trade-balance.png" width="60%" style="display: block; margin: auto;" /> .footnote[ [1]Tilling, 1975 as cited in Tufte, 2007. ] --- class: middle > Information, that is imperfectly acquired, is generally imperfectly retained; > and a man (*person*) who has carefully investigated **a printed table**, finds, when done, that (s)he has **only a very faint and partial idea** of what (s)he has read; and that like a figure imprinted on sand, is **soon totally erased and defaced**. [...] > On **inspecting any one of these Charts** attentively, a **sufficiently distinct impression** will be made, to **remain unimpaired for a considerable time**, and the idea which does remain will be **simple and complete** [...] > <footer>--- William Playfair in The Commercial and Political Atlas</footer> --- class: middle <img src="data:image/png;base64,#inputs/marey-trains-paris-lyon.png" width="80%" style="display: block; margin: auto;" /> --- <br> <br> .center[ ## Now, using R with the {ggplot2} package and the current Paris - Marseille schedule ] --- class: middle <img src="data:image/png;base64,#inputs/marey-trains-paris-marseille.png" width="80%" style="display: block; margin: auto;" /> --- class: middle <img src="data:image/png;base64,#inputs/marey-trains-paris-marseille-traincol.png" width="80%" style="display: block; margin: auto;" /> --- class: middle <img src="data:image/png;base64,#inputs/marey-trains-paris-marseille-dircol.png" width="80%" style="display: block; margin: auto;" /> --- class: middle <img src="data:image/png;base64,#inputs/marey-trains-paris-marseille-direct.png" width="80%" style="display: block; margin: auto;" /> --- class: middle <img src="data:image/png;base64,#inputs/marey-trains-paris-marseille-layover.png" width="80%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#https://cdn.myportfolio.com/45214904-6a61-4e23-98d6-b140f8654a40/9a306c0a-dac8-413d-ba2e-cc7fd4c4d5c8_rw_1920.png?h=c802991088a9623f1f7aa18c470797ee" width="80%" style="display: block; margin: auto;" /> --- ## PREPARING THE DATA FOR TODAY We are going to use data from the [World Inequality Database](https://wid.world/). ```r wid <- read.csv(here("data", "02_wid.csv")) glimpse(wid) ``` ``` ## Rows: 1,610 ## Columns: 6 ## $ country <chr> "Algeria", "Algeria", "Algeria", "Algeria", "Algeria", "Alge… ## $ continent <chr> "Africa", "Africa", "Africa", "Africa", "Africa", "Africa", … ## $ year <int> 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, … ## $ fshare <dbl> 0.0992, 0.1120, 0.1201, 0.1206, 0.1160, 0.1221, 0.1232, 0.12… ## $ top1 <dbl> 0.1003, 0.0991, 0.0991, 0.0991, 0.0991, 0.0991, 0.0991, 0.09… ## $ inc_head <dbl> 12610.627, 12619.984, 12634.026, 12531.988, 12546.430, 12533… ``` **What is the observation level? ** -- Country-year -- Variables: - `fshare` Female labor income share - `top1` Top 1% income share - `inc_head` Income per capita (adults) --- # ggplot() BASICS .pull-left[ Based on *The Grammar of graphics* (Wilkinson, 2005).<br> What is a statistical graphic? A mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars). `ggplot()` creates a canvas to draw on. ] .pull-right[ ```r #install.packages("ggplot2") library(ggplot2) ``` ```r ggplot() ``` <!-- --> ] --- --- # ggplot() BASICS .pull-left[ Based on *The Grammar of graphics* (Wilkinson, 2005). <br> What is a statistical graphic? A mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars). `ggplot()` creates a canvas to draw on. Then we add: <span class="emphasized">🔢 🔡 Data:</span> specifies the dataframe with the **values** to plot ] .pull-right[ ```r #install.packages("ggplot2") library(ggplot2) ``` ```r ggplot(data = wid, # Data ``` <!-- --> ] --- # ggplot() BASICS .pull-left[ Based on *The Grammar of graphics* (Wilkinson, 2005). <br> What is a statistical graphic? A mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars). `ggplot()` creates a canvas to draw on. Then we add: <span class="emphasized">🔢 🔡 Data:</span> specifies the dataframe with the **values** to plot <span class="emphasized">📐 🖌️Aesthetic mappings:</span > relate variables in the data to visual characteristics of the plot. ] .pull-right[ ```r #install.packages("ggplot2") library(ggplot2) ``` ```r ggplot(data = wid, # Data aes(x = inc_head, # Aesthetics y = top1)) + ``` <!-- --> ] --- # ggplot() BASICS .pull-left[ Based on *The Grammar of graphics* (Wilkinson, 2005). <br> What is a statistical graphic? A mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars). `ggplot()` creates a canvas to draw on. Then we add: <span class="emphasized">🔢 🔡 Data:</span> specifies the dataframe with the **values** to plot <span class="emphasized">📐 🖌️Aesthetic mappings:</span > relate variables in the data to visual characteristics of the plot. <span class="emphasized">📈 📊 Geometries:</span> describe how to render each observation. Can be layered to have multiple representations of the data. ] .pull-right[ ```r #install.packages("ggplot2") library(ggplot2) ``` ```r ggplot(data = wid, # Data aes(x = inc_head, # Aesthetics y = top1)) + geom_point() # Geometry ``` <!-- --> ] --- # ggplot() BASICS .pull-left[ Based on *The Grammar of graphics* (Wilkinson, 2005). <br> What is a statistical graphic? A mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars). `ggplot()` creates a canvas to draw on. Then we add: <span class="emphasized">🔢 🔡 Data:</span> specifies the dataframe with the **values** to plot <span class="emphasized">📐 🖌️Aesthetic mappings:</span > relate variables in the data to visual characteristics of the plot. <span class="emphasized">📈 📊 Geometries:</span> describe how to render each observation. Can be layered to have multiple representations of the data. ] .pull-right[ ```r #install.packages("ggplot2") library(ggplot2) ``` ```r ggplot(data = wid, # Data aes(x = inc_head, # Aesthetics y = top1)) + geom_point() # Geometry ``` <!-- --> ] --- ## MAIN STRUCTURE - Data and mapping should be specified in parentheses - The first two arguments in `aes()` are almost always x,y - Geometry and other elements should be added with `+` ```r ggplot(wid, aes(inc_head, top1)) + geom_point() ``` -- <span class="emphasized"> Remember the pipe? </span> - We can also apply `ggplot()` to our data with the pipe `%>%` OR `|>` -- ```r wid %>% # Data ggplot(aes(x = inc_head, # Aesthetics y = top1)) + geom_point() # Geometry ``` --- ### Can you guess the plots? ```r wid %>% ggplot(aes(top1, inc_head)) + geom_point() ``` -- <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-53-1.png" style="display: block; margin: auto;" /> --- ### Can you guess the plots? ```r wid %>% ggplot(aes(inc_head, fshare)) + geom_point() ``` -- <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-55-1.png" style="display: block; margin: auto;" /> --- ### Can you guess the plots? ```r wid %>% filter(country == "Italy") %>% ggplot(aes(year, fshare)) + geom_line() ``` -- <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-57-1.png" style="display: block; margin: auto;" /> --- ### Can you guess the plots? ```r wid %>% filter(year == "2010") %>% ggplot(aes(inc_head)) + geom_histogram() ``` -- ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-59-1.png" style="display: block; margin: auto;" /> --- # AESTHETICS: .greenhead[**AXES**] The quickest functions for the most common modifications: .pull-left[ ```r wid %>% ggplot(aes(inc_head, top1)) + geom_point() + * xlab("Income per adult") + * ylab("Income share among top 1%") + * xlim(0.2, 0.3) ``` <!-- --> ] .pull-right[ ```r wid %>% ggplot(aes(inc_head, top1)) + geom_point() + * xlab(NULL) + * ylab("") + * ylim(NA, 50000) ``` <!-- --> ] --- # AESTHETICS .greenhead[**THEME**] - The theme system allows you to have control over the appearance of all the **non-data** elements of the plot - You can use any of the **default R themes**. Also, you can specify the **font size** using `base_size` ```r ... + theme_gray(base_size = 10) #<< # The default theme ``` -- <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-65-1.png" width="25%" /><img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-65-2.png" width="25%" /><img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-65-3.png" width="25%" /><img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-65-4.png" width="25%" /> -- - You can customize the graph completely using the theme() function - See `?theme` for the endless possibilities and [Chapter 8 of Wickham (2016)](https://ggplot2-book.org/themes) --- # AESTHETICS .greenhead[**GEOM ATTRIBUTES**] We can set the appearance of our observations by specifying their shape, color, fill, size/width, and transparency (*alpha*). **Aesthetic attributes can make or break a plot:** ```r df <- data.frame(x = rnorm(15000), y = rnorm(15000)) norm <- df %>% ggplot(aes(x,y)) norm + geom_point() ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-67-1.png" width="28%" style="display: block; margin: auto;" /> --- ### How to deal with overplotting? <span class="emphasized">Changing the size</span> ```r par(mar = c(3, 3, .1, .1)) norm + geom_point(size = 3) norm + geom_point(size = 2) norm + geom_point(size = 1) ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-68-1.png" width="33%" /><img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-68-2.png" width="33%" /><img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-68-3.png" width="33%" /> --- ### How to deal with overplotting? <span class="emphasized">Changing the shape</span> ```r par(mar = c(3, 3, .1, .1)) norm + geom_point(size =3) norm + geom_point(size = 3, * shape = 1) norm + geom_point(size = 3, * shape = "X") ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-69-1.png" width="33%" /><img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-69-2.png" width="33%" /><img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-69-3.png" width="33%" /> See `vignette("ggplot2-specs")`for the values you can use for these and other aesthetics --- ### How to deal with overplotting? <span class="emphasized">Changing the transparency</span> ```r par(mar = c(3, 3, .1, .1)) norm + geom_point(size = 3) norm + geom_point(size = 3, * alpha = 1 / 2) norm + geom_point(size = 3, * alpha = 0.1) ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-70-1.png" width="33%" /><img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-70-2.png" width="33%" /><img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-70-3.png" width="33%" /> --- ## ADDING DIMENSIONS So far, we've added at most two variables to our plots. We can add more variables using aesthetics like the **color, shape, and size** of the geometries. .pull-left[ ```r wid %>% ggplot(aes(inc_head, top1, * color = continent)) + geom_point() ``` <!-- --> ] .pull-right[ ] --- ## ADDING DIMENSIONS So far, we've added at most two variables to our plots. We can add more variables using aesthetics like the **color, shape, and size** of the geometries. We are **mapping** variable values into visual attributes. .pull-left[ ```r wid %>% ggplot(aes(inc_head, top1, * color = continent)) + geom_point() ``` <!-- --> ] .pull-right[ - ggplot2 takes care of the details converting data ("Europe", "Africa") to aesthetics (Green, Red) with an automatic **scale** - When our mapping variable is continuous, a gradient is used instead - The scale can be overridden and customized. [See Chapter 6 of ggplot2 book for more details. ](https://ggplot2-book.org/scales-colour) - `scale_color_discrete()` - `scale_color_manual(values = c("red", "blue","#000000"))` ] --- ## ADDING DIMENSIONS ```r wid %>% ggplot(aes(inc_head, top1, * color = fshare)) + geom_point(alpha = 0.3) ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-76-1.png" style="display: block; margin: auto;" /> --- ##Setting vs. mapping visual attributes .pull-left[ .nunito[**Setting:**] Defining a **fixed** value ```r wid %>% ggplot(aes(inc_head, top1)) + geom_point(color = "darkblue") ``` <!-- --> ] .pull-right[ .nunito[**Mapping:**] Relating an aesthetic to a **variable** ```r wid %>% ggplot(aes(inc_head, top1, * color = continent)) + geom_point() ``` ```r # Equivalent to this: wid %>% ggplot(aes(inc_head, top1)) + * geom_point(aes(color = continent)) ``` <!-- --> ] --- ### What about this? .pull-left[ <br> ```r wid %>% ggplot(aes(inc_head, top1)) + * geom_point(aes(color = "darkblue")) ``` <!-- --> ] .pull-left[ ] --- ### What about this? .pull-left[ <br> ```r wid %>% ggplot(aes(inc_head, top1)) + * geom_point(aes(color = "darkblue")) ``` <!-- --> ] .pull-left[ - This is mapping the value "dark blue" to a color. - It temporarily creates a new variable containing only the value "darkblue" and then scales it with a color scale. ```r wid %>% mutate(colour = "darkblue") %>% ggplot(aes(inc_head, top1, color = colour)) + geom_point() ``` - We'll learn how it can be useful in a few slides ] --- ### FACETTING .pull-left[ - Another technique for displaying additional categorical variables on a plot is facetting. ```r wid %>% ggplot(aes(inc_head, top1)) + geom_point(alpha = 0.3) + facet_wrap(~continent) ``` - It creates tables of plots by splitting the data into subsets and displaying different graphs for each subset ] .pull-right[ <!-- --> ] --- ### FACETTING .pull-left[ - Another technique for displaying additional categorical variables on a plot is facetting. ```r wid %>% ggplot(aes(inc_head, top1)) + geom_point(alpha = 0.3) + facet_wrap(~continent) ``` - It creates tables of plots by splitting the data into subsets and displaying different graphs for each subset - Choose the facet arrangement using `nrow` and `ncol` to indicate the number of rows and columns - Specify the type of scale you want: - **free**: adjusted separately to each facet - **fixed**: common to all ] .pull-right[ <!-- --> ] --- ### FACETTING ```r wid %>% ggplot(aes(inc_head, top1)) + geom_point(alpha = 0.3) + facet_wrap(~continent, * ncol = 6, * scales = "free_x") ``` <!-- --> --- <br> <br> ## SAVING - `ggsave()` will save what is in the plot panel ```r ggsave(here("outputs", "myplot.png") # Saves what is in the Plots tab ggsave(here("outputs", "myplot.png"), # Where plot = last_plot(), # What width = 16, # Dimensions height = 9, unit = "cm") # Units ``` --- # AGENDA .nunitosm[ .pull-left[ .nunitosmgrey[ - AN INTRO TO DATA VISUALIZATION - THE ggplot() FUNCTION 1. MAIN STRUCTURE 2. AESTHETICS 3. ADDING DIMENSIONS 4. MAPPING vs. SETTING ATTRIBUTES 5. SAVING ] ] .pull-right[ - BROWSING THE TOOLBOX 1. ADDING LAYERS 2. LABELS 3. ANNOTATIONS 4. GROUPING DATA 5. STATISTICAL SUMMARIES - WHICH GRAPH SHOULD I USE? - GRAPHICAL EXCELLENCE AND INTEGRITY ] ] --- ### BROWSING THE TOOLBOX There is a lot. Really **A LOT** you can do with ggplot. <br> Let's illustrate these tools creating a `ggplot()` modern version of one of Playfair's trade balance graphs: <br> - Download and import `02_playfair-balance.csv` <img src="data:image/png;base64,#inputs/02_playfair-trade-balance-nordic.png" width="65%" style="display: block; margin: auto;" /> --- ### ADDING MORE LAYERS .pull-left[ We can keep on adding geometries and geometries to build our dataset. - Every layer we add must have some data associated with it - The data on each layer doesn't need to be the same. We can specify the mappings/aesthetics of each geom (data, x, y) separately. - When a new geom is added it inherits the aesthetics inside `ggplot(aes())` unless specified otherwise - It doesn't hurt to be explicit about which data you are using when you are handling multiple layers ] .pull-right[ ```r balance <- read.csv("../data/raw/02_playfair-balance.csv") ``` ```r balance %>% ggplot(aes(x = year, y = exports)) + * geom_line(aes(y = exports), # First layer color = "red") + * geom_line(aes(y = imports), # Second layer color = "orange") + theme_minimal(base_size = 20) ``` <!-- --> ] --- ```r balance %>% ggplot(aes(x = year, y = exports)) + * geom_line(aes(y = exports), # First layer color = "red") + theme_minimal(base_size = 20) ``` <!-- --> --- ```r balance %>% ggplot(aes(x = year, y = exports)) + * geom_line(aes(y = exports), # First layer color = "red") + * geom_line(aes(y = imports), # Second layer color = "gold") + theme_minimal(base_size = 20) ``` <!-- --> --- .nunito[What happens if we don't specify new mappings for each geom?] ```r balance %>% ggplot(aes(x = year, y = exports)) + geom_line(color = "red") + geom_line(color = "gold") + theme_minimal(base_size = 20) ``` <!-- --> --- ### LABELING EACH LAYER ```r balance %>% ggplot(aes(x = year, y = exports)) + geom_line(aes(y = exports, * color = "Exports")) + geom_line(aes(y = imports, * color = "Imports")) + theme_minimal() + scale_color_manual(values = c("red", "orange"), name = NULL) ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-100-1.png" style="display: block; margin: auto;" /> --- Let's save this plot to add more things to it later: ```r *balance_plot <- balance %>% ggplot(aes(x = year, y = exports)) + geom_line(aes(y = exports, color = "Exports")) + geom_line(aes(y = imports, color = "Imports")) + theme_minimal() + scale_color_manual(values = c("red", "orange"), name = NULL) ``` --- ## LABELS - Adding text to a plot. Your obvious best friend: `geom_text()`. Same as `geom_point()`but with text labels instead of points. Duh 🙄 For example, with our WID data: ```r wid %>% filter(year == 2019 & continent == "Europe") %>% ggplot(aes(x = inc_head, y = top1)) + * geom_text(aes(label = country), alpha = 0.7) ``` <!-- --> --- ### LABELS Now, using it with our trade balance plot to highlight three points: ```r bal_labels <- # Keep only the data for three years balance %>% filter(year %in% c(1702, 1740, 1764)) ``` -- ```r balance_plot + * geom_point(data = bal_labels) + geom_text(aes(x = year, y = exports, label = exports), * data = bal_labels) ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-106-1.png" style="display: block; margin: auto;" /> --- Fixing things: ```r bal_labels <- balance %>% filter(year %in% c(1702, 1740, 1764)) # Keep only these ``` ```r balance_plot + geom_point(data = bal_labels) + geom_text(aes(x = year, y = exports, label = exports), data = bal_labels, * nudge_y = 4, * nudge_x = -2) ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-109-1.png" style="display: block; margin: auto;" /> --- ### ANNOTATIONS - Annotations add metadata to your plot to **highlight** certain features. You can use: - `geom_text()` to add text descriptions or label points (ex. outliers). - `geom_rect()` to highlight rectangular regions of the plot - `geom_line()`, `geom_path()`, `geom_segment()` to add lines - `geom_vline()`, `geom_hline()` to add rulers (lines that span the whole graph) --- ### ANNOTATIONS .pull-left[ ```r balance_plot + geom_vline(xintercept = 1707, linetype = "dashed") ``` <br> <br> <!-- --> ] .pull.right[ ```r balance_plot + geom_vline(xintercept = 1707, linetype = "dashed") + annotate("text", x = 1716, y = 150, label = "Acts of Union between \n England and Scotland") ``` <!-- --> ] --- class: inverse ### EXERCISE - Reproduce this graph with the Taylor Swift data. Hint: add `scale_shape_manual(values = c(1, 16))` at the end. <br> <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-114-1.png" width="1080" />
−
+
10
:
00
--- ```r ts %>% ggplot(aes(x = popularity, y = album)) ``` ```r # # # # # # # ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-116-1.png" width="1080" /> --- ```r ts %>% ggplot(aes(x = popularity, y = album)) + geom_point(size = 6) ``` ```r # # # # # # ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-118-1.png" width="1080" /> --- ```r ts %>% ggplot(aes(x = popularity, y = album, shape = as.factor(is_taylors_version), alpha = danceability)) + geom_point(size = 6) ``` ```r # # # # ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-120-1.png" width="1080" /> --- ```r ts %>% ggplot(aes(x = popularity, y = album, shape = as.factor(is_taylors_version), alpha = danceability)) + geom_point(size = 6) + scale_shape_manual(values = c(15, 16)) ``` ```r # # # ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-122-1.png" width="1080" /> --- ```r ts %>% ggplot(aes(x = popularity, y = album, shape = as.factor(is_taylors_version), alpha = danceability)) + geom_point(size = 6) + scale_shape_manual(values = c(15, 16)) + theme_minimal(base_size = 20) ``` ```r # # ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-124-1.png" width="1080" /> --- ```r ts %>% ggplot(aes(x = popularity, y = album, shape = as.factor(is_taylors_version), alpha = danceability)) + geom_point(size = 6) + scale_shape_manual(values = c(15, 16)) + theme_minimal(base_size = 20) + theme(legend.position = "bottom") ``` ```r # ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-126-1.png" width="1080" /> --- ### Statistical summaries So far, we've only seen **individual geoms**, where there is a graphical object drawn for each observation (e.g. `geom_point()` draws one point per row) A **collective geom** displays multiple observations with one geometric object. They are often used to display grouped summary statistics (e.g. the average popularity by album) or variable distributions through boxplots or histograms/densities. .pull-left[ <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-127-1.png" width="576" /> ] .pull-right[ ```r ts %>% # Create summary statistic group_by(album) %>% summarise(popularity = mean(popularity)) %>% # Plot ggplot(aes(x = album, y = popularity)) + geom_bar(stat = "identity") + # Rotate axis labels theme(axis.text.x = element_text(angle = -40, vjust = 1, hjust = 0)) ``` ] --- ### Statistical summaries So far, we've only seen **individual geoms**, where there is a graphical object drawn for each observation (e.g. `geom_point()` draws one point per row) A **collective geom** displays multiple observations with one geometric object. They are often used to display grouped summary statistics (e.g. the average popularity by album) or variable distributions through boxplots or histograms/densities. .pull-left[ <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-129-1.png" width="576" /> ] .pull-right[ ```r ts %>% filter(!is.na(album)) %>% ggplot(aes(x = album, y = popularity)) + geom_violin(fill = "lightblue", alpha = 0.5) + geom_boxplot(width = .1, alpha = 0.4) ``` ] --- ## HOW TO CHOOSE THE RIGHT PLOT? - **Do you want to see all the observations or summarize the data?** - **Are your x and y categorical or continuous?** See the recommendations in the [ggplot2 cheatsheet](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf) or this [decision tree by Yan Holtz & Conor Healy](https://www.data-to-viz.com/) -- <br> Look into visualization galleries: - [The R graph ghallery](https://r-graph-gallery.com/) for examples of use cases and inspiration on aesthetics - [50 ggplot2 visualizations with R code](http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html) - Examples from economics research from DIME Analytics [Econ Visual Library in R](https://worldbank.github.io/r-econ-visual-library/index.html) and [Stata](https://worldbank.github.io/stata-visual-library/) <br> -- **BEST RESOURCE** for everything related to ``ggpplot()` - [ggplot2: Elegant Graphics for Data Analysis by Haldey Wickham](https://ggplot2-book.org/). --- --- ### GRAPHICAL EXCELLENCE .greenhead[4 qualities of great visualizations] .nunito[**1. HONEST**] <img src="data:image/png;base64,#./inputs/02_tesla-netflix-stock.png" width="50%" style="display: block; margin: auto;" /> -- Being careless with scales and axes is dangerous - Zooming or unzooming the graphs can be very misleading. Start your axis at 0 (most of the time). - Double axes can say whatever you want them to say. Plot things on the same scale. --- .greenhead[4 qualities of great visualizations] .nunito[**2. FUNCTIONAL**] - Your graph should convey information right but also **help the audience interpret the data correctly** - - It should be understood without additional context (label axes and add proper legends) .pull-left[ <br> <br> <br> <img src="data:image/png;base64,#./inputs/02_music-pie.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#./inputs/02_music-slope.png" width="60%" style="display: block; margin: auto;" /> ] --- --- .greenhead[4 qualities of great visualizations] .nunito[**3. BEAUTIFUL**] - Your graph should be attractive and **✨aesthetically pleasing✨**. Declutter! <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-135-1.png" width="1440" style="display: block; margin: auto;" /> --- .greenhead[4 qualities of great visualizations] .nunito[**3. BEAUTIFUL**] - Your graph should be attractive and **✨aesthetically pleasing✨**. Declutter! <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-136-1.png" width="1440" style="display: block; margin: auto;" /> --- .greenhead[4 qualities of great visualizations] .nunito[**4. INSIGHTFUL**] - A graphic should reveal evidence that we would have a hard time seeing otherwise. The purpose of visualization is insight, not pictures. - From the creators of "*meetings that could have been an e-mail*", we have: "*graphs that could have been a simple table or sentence*" <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-137-1.png" width="576" style="display: block; margin: auto;" /> --- ###HOMEWORK Submit using [this](https://classroom.github.com/a/VrRecub6) link - Use the `02_playfair-wages-wheat,csv` dataset and and replicate this graph as best as you can using `ggplot()` <img src="data:image/png;base64,#inputs/playfair-wages-wheat.png" width="50%" style="display: block; margin: auto;" /> - 🆗 OK: Plotting wheat prices, wages and the timeline of English rulers in the right geoms and colors - 👍🏽 Great: Getting the axes (you might need `dup_axis()`) and the overall appearance of the geoms as similar as you can, as well as including the annotation in the middle of the graph ("Chart showing...") - 🤩 Amazing: Adding the label over the wages series ("Weekly wages of a good mechanic"), customizing the appearance of the grid and including the labels of the English rulers --- ###HOMEWORK Submit using [this](https://classroom.github.com/a/VrRecub6) link - Use the `02_playfair-wages-wheat,csv` dataset and and replicate this graph as best as you can using `ggplot()` <img src="data:image/png;base64,#inputs/playfair-wages-wheat.png" width="50%" style="display: block; margin: auto;" /> After completing this or after at least **30 minutes of trying on your own** — you may use an AI tool (e.g., ChatGPT) to get help. Add a short reflection at the bottom of your script: 1. What prompt did you give to the AI? (Paste it) 2. What kind of answer did it give you? Was it helpful? Why or why not? 3. How did you adapt or modify what the AI gave you, if at all? Include your R script (with AI reflection) and final image --- ### SLIDES THAT DIDN'T MAKE THE CUT - Axes can be modified with **scale functions**. The following parameters can be specified: - `name` the label of the axis - `limits` where the axis starts and ends - `breaks` where to put ticks and values on the axis .pull-left[ ```r wid %>% ggplot(aes(inc_head, top1)) + geom_point() + # ``` ] .pull-right[ ```r wid %>% ggplot(aes(inc_head, top1)) + geom_point() + scale_x_continuous() + scale_y_continuous() ``` ] --- ## Axes ```r wid %>% ggplot(aes(inc_head, top1)) + geom_point() # ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-143-1.png" width="60%" /> --- ## Axes ```r wid %>% ggplot(aes(inc_head, top1)) + geom_point() + scale_x_continuous(name = "Income per adult", limits = c(0, 150000)) + scale_y_continuous(name = "Share of income among top 1%") ``` <img src="data:image/png;base64,#02_data-visualization_files/figure-html/unnamed-chunk-144-1.png" width="60%" />